Finding Similar Books With Embeddings

I asked the system for "70% this machine learning textbook + 30% this philosophy book" and it returned a book on computational epistemology that I'd forgotten I owned. The linear interpolation shouldn't work, averaging two points in a 512-dimensional space seems like it should produce noise, but the embedding space has enough structure that weighted combinations of book vectors land near real books that bridge both domains.

That query is a parlor trick, but it convinced me the underlying representation was capturing something real.

How to Embed a Book

You can't feed a whole novel into Google's Universal Sentence Encoder: the model expects short text. My approach: sample 120 sentences from each book, embed each one into a 512-dimensional vector, and take the median of those 120 vectors as the book's representation.

pipeline:
  text_extraction: PyPDF2/pdfminer (PDF), ebooklib+BeautifulSoup (EPUB)
  sampling: 120 sentences from middle pages (>12 pages: middle 10)
  embedding: Universal Sentence Encoder (512 dimensions, TF Hub)
  aggregation: median vector

120 sentences is enough to capture a book's voice and subject matter without the computation dragging. Median instead of mean because the median is robust to outliers: if a physics textbook has one chapter about the author's childhood, those sentences won't pull the median the way they'd pull the mean. For books longer than 12 pages, sampling only from the middle 10 avoids front matter (copyright notices, tables of contents, dedications) and back matter (indices, bibliographies), none of which represents the actual content.

Sentence sampling is the design choice that matters most. The embedding model is fixed, the aggregation method matters somewhat, but which sentences go in determines everything.

Queries

Once every book is a point in vector space:

Closest books: minimize the sum of squared differences. Technical books cluster near other technical books on the same subject; fiction clusters by genre and tone without anyone telling the model what genre means.

Style similarity: maximize correlation between vectors rather than minimizing distance. Two books can be far apart in topic but close in sentence structure, vocabulary level, and rhetorical patterns.

Representative sentence: for each book, find the single sentence whose embedding is closest to the book's median vector. It acts as an automatic summary sentence. The sentence the model picks is often not the one you'd choose, but it captures the book's feel in a way that a hand-written summary might miss.

Book mixing: the weighted average query described above.

Categorization and Duplicates

Beyond search, the embeddings support Gaussian models for automatic categorization: fit a Gaussian to each category's book vectors, then classify new books by maximum likelihood. It's rough (512 dimensions with maybe a dozen books per category means noisy covariance estimates), but it catches the obvious groupings.

Duplicate detection was an unplanned bonus. My library had several books under different filenames whose embeddings were nearly identical. A simple distance threshold flags these, no content hashing needed.

The tool runs over my full library now. When I finish a book and want something with a similar feel, a single query gives me a ranked list in seconds.